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Abstract 

In this paper, we establish the convergence of the Optimal AdaBoost classifier under mild 
conditions. We frame AdaBoost as a dynamical system, and provide sufficient conditions 
for the existence of an invariant measure. Employing tools from ergodic theory, we show 
that the margin for every example converges. More generally, we prove that the time 
average of any function of the weights over the examples converges. If the weak learner 
satisfies some common conditions, the generalization error is not changing much in the 
limit. We conjecture that these conditions are satisfied on almost every dataset, and show 
preliminary empirical evidence in support of that conjecture. 

Keywords: AdaBoost, convergence dynamics, generalization error, margins, boosting 



1. Introduction 



Leo Breiman once called AdaBoost (Freund and Schapire, 1997) the best off-the-shelf clas 



sifier for a wide variety of datasets (Breiman, 1999). Fourteen years later, AdaBoost is still 



widely used because of its simplicity, speed, and theoretical guarantees for good perfor- 
mance. However, despite its overwhelming popularity, there is still a mystery surrounding 



its generalization performance (Mease and Wyner, 2008). 



On each iteration of AdaBoost a new hypothesis, generated by a weak learning algo- 
rithm, is added to a running linear combination of hypotheses. Intuitively this combination 
of hypotheses is increasing in complexity the longer the algorithm is run. Meanwhile, the 
generalization performance of this ensemble tends to improve or remain stationary after 



a large number of iterations, contradicting standard VC-Dimension based bounds (Freund 



and Schapire, 1996 Drucker and Cortes, 1995; Breiman, 1998; Quinlan 1996). In some 



cases, the generalization error continues to decrease long after the training error of the 



chain has reached zero (Schapire et al. , 1998). A common graph depicting this behavior is 



Figure [T} Remarkably, the complicated combination of 1000 trees generalizes better than 
the simpler combination of 10. 

Solving this paradox has been a driving force behind boosting research, and various 
explanations have been proposed. By far the most popular among them is the theory of 
margins. The generalization error of any convex combination of functions can be bounded by 
a function of their margins on the training examples, independent of the number of classifiers 
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Figure 1: Training and test error (y-axis) for AdaBoosting C4.5 on the letter dataset for 
up to 1000 rounds (x-axis, in fog-scale). This plot, which originally appeared 
Schapire et al. ( 1998| ), is still featured in many tutorials and talks, but without 



m 



any definitive formal explanation (see Breiman (1999) and Grove and Schuurmans 



( 1998 ) for experimental evidence against the "max- margin theory" originally put 
forward as an explanation for this behavior by Schapire et al. (1998).) We believe 
our contribution in this paper provides a rather convincing, formal explanation 
for this behavior. 



in the ensemble. AdaBoost provably produces large margins, and tends to continue to 



improve the margins after the training error has reached zero (Schapire et al. , 1998). 



The margin theory is effective at explaining AdaBoost 's generalization performance at a 
high level. But it still has its downsides. There is evidence both for and against the power of 



the margin theory to predict the quality of the generalization performance (Breiman, 1999 



Rudin et al. 2004 2007). But the most striking problem is that the margin bound is very 
loose: it does not explain the precise behavior of the error. For example, when looking at 
Figure [l] a couple of questions arise. Why is the generalization error not fluctuating wildly 
underneath the bound induced by the margin? Or even, why is the generalization error not 
approaching the bound? Remarkably, the error does neither of these things, and seems to 
converge. 

This phenomenon is not unique to this dataset. We can see this convergence on many 
different datasets, both natural and synthetic. Even in cases where AdaBoost seems to be 
overfitting, the generalization performance tends to converge eventually. Take for example 
Figure [2] For the first 5000 rounds it appears that the algorithm is overfitting. Afterwards, 
its generalization error converges. 

To account for this convergent behavior, Breiman (2001) conjectured that AdaBoost 
was an ergodic dynamical system. He argued that if this was the case, then the dynamics 
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Figure 2: Test error (y-axis) of AdaBoosting decision-stumps on the Heart-Disease 
dataset (Frank and Asuncion, 2010b) for up to 20,000 rounds (x-axis). Note 



the same converging behavior exhibited for the letter data set in Figure [T} This 
converging behavior is typicaUy observed in empirical studies of AdaBoost. 



of the weights over the examples behaves like selecting from some probability distribution. 
Therefore AdaBoost can be treated as a random forest. Using the strong law of large 
numbers, it follows that the generalization error of AdaBoost converges for certain weak 
learners. 

In this paper, we follow a similar approach. We frame AdaBoost as a dynamical sys- 
tem (Rudin et al. , 2004). From here we establish sufficient conditions for an invariant 



measure on this dynamical system. We do not require this measure to be ergodic, so this is 
weaker than Breiman's requirement. Using tools from ergodic theory, we show that such a 
measure implies the convergence of the time average of any function of the weights over the 
examples. In particular, this shows that the margin for every example converges. We also 
show that the AdaBoost classifier itself is converging if the weak learner satisfies certain 
conditions. Ultimately we prove our main result: the generalization error does not change 
much in the limit. We hope these results shed light on "the most important open problem 
in machine learning" (Breiman, 2002). 



2. Related Work 

Despite the flurry of convergence results for AdaBoost within the last 10 years, we believe 
our result is unique. We refer the reader to Schapire and Freund (2012) for a textbook 
account of the state-of-the-art in AdaBoost research. 
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In this section, we briefly discuss and contrast those contributions we beheve are closest 
to ours of which we are aware. It is important to note that a number of these convergence 
results concern variants of AdaBoost, such as regularized boosting (see, e,g., Rosset et al. 



(2004), Lozano et al. (2006), Xi et al. (2009), and the references therein). Those convergence 
results that do concern AdaBoost in its original form show convergence for other aspects 
of the algorithm like the exponential loss, as discussed in the next paragraph. We will now 
walk through some of the recent research of AdaBoost. 

A bulk of the asymptotic analysis on AdaBoost has been focused how it minimizes the 
exponential loss. AdaBoost can be viewed as a coordinate descent algorithm that iteratively 



minimizes the exponential loss (Breiman 1999; Mason et al. 2000; Friedman et al. , 1998). 



Under the weak learning condition, this minimization procedure was well understood, and 
has a quick rate. Later AdaBoost was shown to minimize the exponential loss even without 
the weak learning assumption (Collins et al. , 2002; Zhang and Yu, 2005), but no rates were 
provided. Finally, Mukherjee et al. (2011) proved that AdaBoost enjoys a rate polynomial 
in 1/e . Telgarsky (2012) achieves a similar result by exploring the primal dual relationship 
implicit in AdaBoost. These results all concerned the convergence of the exponential loss. 
Meanwhile, in this paper we are interested in the convergence of the basic Optimal AdaBoost 
classifier itself, along with its margins, and any time averaged function of its weights. 

Consistency is another asymptotic concern about the behavior of AdaBoost. There are a 



number of papers that show that variants of AdaBoost are consistent ( Zhang , 2004 Lugosi 



and Vayatis 




2004 




Zhang and Yu[ 


assumptions (Peter J. Bickel 


2006) 



or that AdaBoost is consistent under certain 



Bartlett and Traskin (2007) shows that AdaBoost is 



consistent if stopped at time n^~^ for e G (0,1), where n is the number of examples in 
the training set. Consistency is distinct from the notion of convergence in this paper. An 
algorithm is consistent if its generalization error approaches the Bayes risk in the limit of 
the number of examples in the training set. Here our concern is the generalization error in 
the limit of the number of iterations of the algorithm on a fixed sample. 



Merler et al. (2007) provides empirical evidence for the statistical regularity underlying 



AdaBoost. 

Others have also approached the study of AdaBoost from a dynamical-system perspec- 
tive. Rudin et al. (2004) pioneered this approach, demonstrating that AdaBoost enters 
cycles in many low dimensional cases. They proved that when AdaBoost cycles, its asymp- 
totic behavior can be fully understood: the minimum margin converges, and many of the 
results of this paper can be demonstrated. However, little is understood in the non-cyclic 
case. Chaotic non-cyclic behavior occurs on some higher dimensional cases: typical behav- 
ior for AdaBoost on large real world datasets. In fact, whether AdaBoost always cycles 
remains an open question to this date (Rudin et al. , 2012). 

We view our paper as an extension to the cyclic behavior of AdaBoost, as many of the 
results are the same, but our work encompasses non-cyclic, chaotic behavior. 



3. Background and Notation 

We make the assumption that all samples are taken from a probability space. The set of 
possible examples will be denoted D = X x {0, 1}, where X are the instances and {0, 1} are 
their labels. We will view D as part of a probability space, denoted D = (Z),E,P). E is 
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the set of possible events, and P is a probability measure mapping S — t- R. Samples taken 
from D will be denoted S, with S C X and S = {x^-^\x^'^\ . . . ,x^^^}. The set of weak 
hypothesis that AdaBoost will be provided is denoted by H, and may be finite or infinite. 
This set of hypothesis induces a set of dichotomies on a training set S. We denote this set 
of dichotomies as T>ich{'H, S) = {h^^\ h^'^\ . . . , /i^")}. An element h of Dich('?^, S) is treated 
as a 0-1 row vector, with h^'^\j) being the label hypothesis /i^*^ gives example x^^\ From 
this set we can derive an error matrix. This matrix has the form M{i,j) = l[y{j) = h^^\j)]- 
A row i of the matrix can be thought of as a 0-1 vector indicating where dichotomy i is 
incorrect on the set S. We will abuse notation and treat M as a set. When we do this, the 
set elements are the rows of the matrix, we will call these rows dichotomies, and we denote 
them by rj. For example, we often state "for all rj G M" . By this we mean "all rows rj in 
M". 



4. AdaBoost as a Dynamical System 

This paper studies optimal AdaBoost as a dynamical system of the weights over the exam- 



ples, in a way similar to Rudin et al. (2004). Here we break down the components of the 
AdaBoost algorithm and frame it as such a system. We will fix T-L and S, therefore fixing 
Dich(^, S) and M. The state space of the system is the standard m-simplex, denoted 

= ju; G R™| ^w(i) = 1 and for all i, w{i) > o| . 

We will often denote elements of Am as w. 

AdaBoost extensively uses the weighted error of a hypothesis in its weight update. The 
typical notion for this is 

m 

err(/i,w) = '^w{i)l[h{xi) / yi]. 
1=1 

However, for much of our analysis we will reduce AdaBoost to only using the rows of M in 
its weight update. When -q G M, we denote the error as 

err(r/, w) = r] ■ w. 

As we are working with optimal AdaBoost, we need a notion of "best" row in M for 
any weight w G Am- The standard notion for best row is any in arg min^^gj^,^ rj ■ w. However, 
multiple rows may be in that set. We need a policy for how we break ties between rows 
with the lowest error, so we assume there exists such a tie-breaking function AdaSelect : 
V{M) M. 

Definition 1 Given a weight w G Am, our notion for best row in M for w is defined as 



rf = AdaSelect arg min rj ■ w 
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This selection procedure naturally partitions Am into regions where different rows are 
best, in the sense that they would be selected by AdaSelect. 

Definition 2 Given some arbitrary r] S M , define 

a*{r]) = {w£Ara\7] = v'"}- 

Note that CT*{ri) may be open or closed for different r/'s, depending on how ties are 
broken using AdaSelect. The closure of this set will also play an important role. 



Definition 3 Given some arbitrary r] € M , define 



^iv) = \w £ Ami ry G argminr/' • w 
I VeA/ 



The set cr(??), being the closure of o'*{rj), is naturally closed. However, these sets no 
longer form a partition on A^. Given two distinct dichotomies 771,7^2 £ M, it is possible 
that o-(r?i)no-(7?2) / 0- 

Sometimes it is convenient to consider only the subset of A^ where every row in M has 
non-zero error. 



Definition 4 The set of all weights in Am with non-zero error on all dichotomies is defined 
as 



ao = {w £ Am \ r] ■ w > for all rj G Af} . 



We will depart from standard notation for the AdaBoost weight update. The notation 
we use will be more convenient for the main proofs in this paper. First, we have a notion 
of a hypothetical weight update. That is, given w E Am if we assume that r] = rf , where 
would the AdaBoost weight update take wl 

Definition 5 Given an arbitrary row r] G M , we define Tq ■ Am Am component-wise as 



1 — rj ■ w 



1-% 



7^ certainly does not trace out the actual trajectory of the AdaBoost weights. The true 
update first finds the best row r]^ , and then applies Tri^^w). 

Definition 6 The AdaBoost weight update is A : Am Am, defined as 



A{w) = Tr,^{w). 



1. The form of the update we use has previously appeared in [Grove and Schuurmans (19981; Oza (2001 1; 
Rudin et al.lpOOil. 
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We can now trace the trajectory of the AdaBoost weights by repeatedly applying A to 
an initial weight picked within A^. More specifically, if wi G Am is taken as some initial 
point, we can rederive any wt in our original formulation of the algorithm with 

wt = A^'-'Hwi) (1) 

where A^^~^^ denotes composing A with itself t — 1 times. 

We can also derive many of parameters calculated by AdaBoost solely in terms of wt- 

Definition 7 The following are functions defined from Am to R. 

1. e{w) = miUrj^M V ' w 

2. aH = ilog(i^) 

Fact 8 The following equalities hold 

1. et = e{wt) = e{A^'~^\wi)) 

2. at = a{wt) = a {A^*'^\wi)). 

3. r]t = v'^' =^A('-^Hm) 

The parameters described in Fact |8] will be called secondary parameters, because they 
can be derived solely from the weight trajectory. We seek to understand the statistical 
convergence properties of these secondary parameters, and the properties of the mapping 
A that causes such behavior. 

5. Convergence of the AdaBoost Classifier 

As mentioned at the end of the previous section, the secondary parameters of AdaBoost can 
be written as functions based solely on the trajectory of A applied to some initial wi^ G Am- 
Evidence suggests that not only are these averages converging, but the AdaBoost classifier 
itself is converging. The study of the convergence of the classifier, and its implications, is 
the main goal of this section. 

Recall that AdaBoost classifies examples by computing 

sign(FT(2;)) = sign ^^04/14(2;)^ . 

Carefully note that in this case, ht correspond to hypotheses in "H, not dichotomies in M. 
Certainly for x £ S, Ft{x) does not converge as T approaches infinity (in fact, it must be 
growing at least linearly in T if the weak- learning assumption holds) . So what do we mean 
by convergence of the AdaBoost classifier? We can replace sign{FT{x)) with sign{j^FT{x)) 
or sign(marginy(j;)) in our notion of classification. Then, under certain assumptions, if 
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either ^Ft{x) or margin2^(x) converge for all x, so does sign{FT{x)) and sign(margin'p(x)) 
respectively. It is this way that we mean "convergence of the AdaBoost classifier" . 

Using these results we can establish, under some reasonable conditions, convergence of 
the same functions on any x outside of the training set. The upshot is that given these 
kinds of convergence, we can say something strong about how the generalization error of the 
AdaBoost classifier behaves in the limit. Intuitively, if the AdaBoost classifier is effectively 
converging, so should its generalization error. This outlines the main contribution of this 
paper. 

Crucial to our understanding of this convergence is Birkhoff 's Ergodic Theorem, stated 
as Theorem [9] below. This theorem gives us sufficient conditions for statistical convergence, 
which we will then apply to our secondary parameters. Taking center stage in this theorem 
is the notion of a measure and a measure-preserving dynamical system. To be able to apply 
Birkhoff 's Ergodic Theorem, we need to show the existence of some measure m such that 
(A^,B,^, m) is a measure-preserving dynamical system. Please see the appendix for a 



closer look at these topics. The existence of such a measure is given in Proposition 14 



but relies on Assumption [12] The context surrounding these is discussed in greater detail 
shortly. 

Equally important in Birkhoff 's Ergodic Theorem is the notion of integrability, captured 
by the notation / S L^{m). This notation says that / is integrable with respect to the 
measure m. The precise meaning of this is that, first and foremost, / is measurable. Second, 
that 



/ 



\f\dm < 00. 



If these two conditions hold, it follows that / G L^{m). Proposition 15 shows us that various 
parameters of AdaBoost are in L^{m), therefore can analyzed using Theorem [9| 
We our now ready to introduce the theorem. 



Theorem 9 ( Birkhoff Ergodic Theorem ) Suppose A : X ^ X is measure-preserving and 
f G L^{m) for some measure m. Then ^ X]"=o^ f{A!'{x)) converges almost everywhere to a 
function f* £ L^{m). Also f*oA = f* almost everywhere and if m{X) < 00, then J f*dm 
= J fdm. 

We care about the asymptotic behavior of AdaBoost, and want to disregard any of its 
transient states. Therefore we would like to look at a subset of its state space that the 
dynamics will limit towards, or stay within. The following set is, intuitively, the set of 
non-transient states of A. More specifically, it is the set of states that AdaBoost can reach 
for any time step t. 

Definition 10 n^c = fl^i-^^^H^m) 

The set Qoo can be thought of as a trapped attracting set in the typical sense, because 
Afn is compact and ^(A.^) ^ Int{Ara)- The continuity properties of AdaBoost on this 
set is important to establish an invariant measure in Section [Tj But it is difficult to say 
anything important about this set yet. 
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It turns out that there are discontinuities at many points in the state space. It is not 
difficult to see that any point w £ Am that yields more than one row in arg min^g^ rj ■ w 
will be a discontinuity. We will call this a type 1 discontinuity. Similarly, any point that 
has 7]'^ ■ w = will also be a discontinuity, which we will call a type 2 discontinuity. In the 
following theorem, we establish that A will be continuous on any point besides those just 
mentioned. 

Theorem 11 AdaBoost is continuous on all points w such that w G Int{a{r])) n do for 
some rj. 

Proof Let D = Int{a{'q)) n ctq. Take any w £ D, and let {wi} be an arbitrary sequence 
such that limj_-!.oo "U^i = w. Let {wj} be the tail of {wi} that is contained within D. Then, 
we have the following for all wj 

A{Wj)=Tri{wj). (2) 

From Definition [5l for all Wj it follows that 

I \vin) / Y \ i-'jC") 

T] ■ w J \1 — [r] ■ w) J 

Because limj^oo wj = w, we have \im Wj(n) = w{n). Similarly, we have limj^oo r]-Wj = 
rj • w. Furthermore, by the weak learning assumption we have < r] ■ w < ^. Combining 
these facts, we see that 

lim Tv{Wj) = Tn{w). 

j-5>00 

Recalling Equation [2[ we complete the proof. 



[Tr^{Wj)]n = Wj{n) X 



In the next section, we will prove that the weight trajectories of AdaBoost are eventually 
bounded away from type 2 discontinuities. Then we assume that AdaBoost satisfies the 
condition that it is bounded away from type 1 discontinuities. This assumption will be 
instrumental in our analysis, and gives us a way of proving the essential existence of an 
invariant measure. The assumption is formalized below. Roughly speaking, this assumption 
essentially says that, after a sufficiently long number of rounds either (1) the dichotomy 
corresponding to the optimal weak hypothesis for a round is unique with respect to the 
weights at that round, or (2) the dichotomies corresponding to the hypothesis that are tied 
for optimal are essentially the same with respect to the weights in Oqo- 

Assumption 12 (Optimal AdaBoost Eventually Has No Ties.) There exists a 
compact set G such that f^oo ^ G and, given any pair rj, r]' G M , we have either 

1. a{r]) n a{r]') n G = 0; or 

2. for all W£G, J2i:ri{i)^ri'{i) ^(0 = 0- 

We conjecture that this assumption holds almost always in practice. 



9 



Belanich and Ortiz 



Conjecture 13 (No Ties) Assumption 12 holds for Optimal AdaBoost, modulo minimal 



conditions on the weak-hypothesis spaces, the process generating the training data, and, 
indirectly, the dichotomies they induce. 



Note that part (2) of Assumption 13 allows us to reduce the set of dichotomies to only 
those that will never become effectively the same from the standpoint of Optimal AdaBoost 
when dealing with Oqo. 

We now declare two propositions that cover the conditions required to apply Theorem |9| 
For now, we take these propositions for granted. They will be proved and discussed in later 
sections. 



Proposition 14 covers the first condition of Theorem ^ that ^ is a measure-preserving 



dynamical system for some measure /i. Afterwards, Proposition 15 covers the second con- 
dition for the secondary parameters of AdaBoost that will be essential in the impending 
analysis. By combining the two propositions we are able to use Birkhoff 's Ergodic Theorem 
as a hammer to prove our main results. 

Proposition 14 Suppose Assumption^T^holds. Then there exists a Borel probability mea- 
sure fj, on Am such that {Am,B, fi,A) is a measure-preserving dynamical system. 

Proposition 15 Let /i be a Borel probability measure on Am- Then the following functions 
are in L^{fi): 

1. e{w) = min^gAf rj ■ w 

^- «H = iiog(i^) 

3. Xa'iv)M = l[we a*{ri)] 

If we take the above propositions for granted, we can get a convergence result for ^FT{xi) 
for Xi in the training set. Note that in the below theorem we depart from the standard 
notations of Fxixi) = Ylt=i C(th{xi). The new notation defines FT{xi) in terms of the rows 
in the matrix M constructed from the dichotomies of H. These dichotomies are defined 
over {0,1}, unlike the hypotheses over {—1,1}, so we need to scale and translate them 
appropriately. The new notation for Fxixi) results in the exact same values as the one 
defined over hypothesis. 

Theorem 16 Let Fxix-i) : S ^IZ, written Fr(xj) = Yld=i 2at(ryt(i) — 1), be the AdaBoost 
classifier function at round T . For all Xi G S, the limit lim-r^oo ^Frixi) exists. 



Proof Let 



t=l 



From our assumptions it is easy to deduce that a ■ Xa*{r)) is measurable. Furthermore, 
we have 

oi{w)Xa* dfi < a{w) d^ < oo. 

W^Am Jw^Am 
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Whereby it follows that a • X(7*{ri){''^) ^ -^^(a*)- Applying Birkhoff's Ergodic Theorem, 
we see that limr^oo T^r]{T) exists for all r]. For convenience, set C* = limT-).oo t^vC^)- 
We are restricting ourselves to the set S C X, hence we can write 



^Frix) = ^ ^2C,{T){r,{i) - 1). 



T 



Finally, by taking limits we see that 

hm ^Ft{x) = [^Cr,{T)]2irj^) - 1) 



T-^oo T ' ' ^ T^oo'T 



As a corollary to this theorem, we can show convergence for the margin on any example 
in the training set. 



Corollary 17 Consider the function margin2^(a;i) : S ^ TZ 



marginy(xi) = Fxixi) (^J^a^ 



For all X E S, the limit lim-r^oo margin2^(xi) exists. 



Proof By Birkhoff's ergodic theorem, = lim^^oo ^ Ylt=i Q^(^t) exists. From the weak 
learning assumption, we know that e* = e{wt) < g — 7 for some 7 > 0. This gives us a lower 
bound a* > on a{'Wt). Using this lower bound, we see that 

1 ^ 

6 = lim — > a(wt) 
t=i 

> lim ^(a*r) 

> a* 

> 0. 
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Now, we can say that limr^oo T \ 'Y^=i o^t) — Combining this with Theorem 



we have for all x,- G 5 



16 



/ T \ 

lim marginj.(xi) = lim Fxixi) |> at] 

\t=i J 



-1 



-1 



= J]2C,(r/(i)-l). 

Where is a probability distribution over the rows in M. ■ 

We run into some difficulties if we try to extend the above results to the whole instance 
space X instead of just the training set 5. On S, we know that 2{r}{i) — 1) will correspond 
directly with some h{xi). However, outside of S our t^'s are no longer defined, because they 
are simply 0-1 vectors over the examples in S. To evaluate Ft{x) for an arbitrary x ^ X, 
we must appeal to the hypotheses that were selected from the hypothesis space, not just 
the dichotomies they produced. 

Let H{rj) = {h ^'H\ 2{r]{i) — 1) = h{xi) for all Xi £ S}. A key observation is that H{r]) 
induces an equivalence relation onH: % = U,,eA/ -^(^) ^'^d -ff (r/i) n -ff (772) = for any pair 
r]i,rj2 G M. All hypotheses in each equivalence class, from the perspective of AdaBoost, 
are indistinguishable in the sense that picking any of them will result in no change in the 
trajectory of wt- However, the weak learner might have a bias towards certain hypotheses 
in these classes. For example, perhaps the weak learner will always pick the "simplest" 
hypothesis in H{ri), based on some simplicity measure (depth of the tree, number of leaves, 
etc). 

Regardless, in many cases the weak learner will always pick a specific hypothesis in each 
equivalence class. In this case, each class has a representative which we will denote h"^. 
Then whenever 77 = r;"" we have ht = selected by the weak learner. We also get a finite 
number of hypothesis selection candidates in this case. If there are n rows in M, we are 
effectively reducing the hypothesis space from H to {KP^, , . . . , /i^"}. 

This is a common selection scheme in the case of decision stumps. In that case, a 
matrix very similar to M is often constructed. Then a pruning procedure is employed 
on M, removing repeated and dominated. The result is a scheme following the above 
framework. 

If we assume our weak learner follows the framework just described, we can extend our 



results in Theorem 16 to the whole space X 



2. A dichotomy 77 is dominated by -q' if the set of examples incorrectly classified by r; is a strict superset of 
the examples incorrectly classified by 77', because the error rj ■ wt > i]' ■ Wt for all t, thus -q will never be 
selected by Optimal AdaBoost. 
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Theorem 18 Suppose our weak learner follows the above framework. Then ^Ft{x) con- 
verges for all X G X . 

Proof Let rjt = rj"^^ , then the hypothesis selected at iteration t can be represented as 
ht = h^'K This yields 



T 

' (X) 



t=i 
1 ^ 

-^ath^^ix) 
t=i 

^C,{T)h\x). 



T 



Where Crj is defined in the same way as in the proof of Theorem 16 As in the same 
proof, we have limT.j.oo j^Cr](T) = C*. Hence, 



hm 1-Ft{x) = C;h\x). 



T->oo T 



Similarly, we can extend the convergence of the margin distribution to the whole space 

X. 

Corollary 19 Suppose our weak learner follows the above framework. Then marginxix) 
converges for all x £ X . 

Proof We arrive at the convergence of Uuit-^oo Cn{T) {^2^=1 ^t] ~ ^vO^) same way 



as in Corollary 17 Then, closely following the above proof, we get 



lim marginxix) = lim Ft{x) > atht{x) 

\t=i / 



-1 



Recall that the full AdaBoost classifier is 



Ht{x) = sign(FT(x)). 

In that equation, we can easily replace sign(i<T'(a;)) with sign(7j^i<j'(a;)). From the con- 
vergence of ^Ft{x), we would like to say that Ht[x) converges as well. However, we have 
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a discontinuity in the sign function at 0. It may be the case that Hmy^oo x^t{x) = 
for some x ^ X, possibly yielding a non-existent limit for sign^^pFxix)). In that case, 
Wmx^oo Ht{x) simply does not exist. 

To overcome this obstacle, we will split the analysis. We will first consider the case that 
Wmx-^oo -^Ft{x) 7^ almost everywhere; then we consider the case where this is not true. 

For the first case, the limit of the classifier behaves nicely. This has the fascinating 
implication that the AdaBoost classifier itself, Ht, is converging in classification for almost 
all elements in the instance space X. 

Theorem 20 Suppose our weak learner follows the above framework, and thatliuiT^oo j^Fri' 
for almost all x £ X . Then the limit H*[x) = limj'_s.oo Ht{x) exists almost everywhere. 

Proof Let H*{x) = limr_j.oo Ht{x). Then the following holds: 
H*{x) = lim Ht{x) 

= lim sign(;^FT(2;)) 

= sign(F*(x)) (almost everywhere) . 



If the AdaBoost classifier is converging in the limit, certainly its generalization error 
should as well. The generalization error of any 0-1 classifier H(x) is written 



1 - yH{x) 



(3) 



It follows from the Lebesgue Dominated Convergence Theorem that the generalization 
error converges. 

Theorem 21 The limit of the generalization bicicd^Ht), exists. 



Proof First, we assume that each Ht is measurable on the probability space {D,T,,P). 
Otherwise it would not make sense to discuss generalization error in the first place, as it 
would not exist for our classifier at any time T. It follows that H* is measurable, because 
lim3^_j.oo Ht{x) = H*(x) almost everywhere. Furthermore, ^Lj^^zM^ dominated by the 
constant function f{x,y) = 1 for all {x,y) £ D and all T: 



yHrix) 
2 



< 1. 



Therefore, the conditions of Lebesgue Dominated Convergence Theorem are satisfied. 
We may then say that 

3. If you let F*{x) = lirriT^oo ^Ft{x), intuitively we are saying that the decision boundary of the function 
F* has measure in the probability space {D, E, P). 
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lim eicrn(HT) 



= lim 



1 - ijlhix) 



1 - yHrix) 



I { 

J{x,y)&D 

r n-yH*{x) 

J(x,y)eD \ 2 



dP 



T^-oo V 2 



dP 



We now consider the second case: the set of x e X such that hm^-^oo ^Ft{x) = has 
non-zero measure. Let, 



E=\ {x,y)eD 



l^^^FHx)^o]. 

Wc let the compliment of this set to take non-zero measure. Assuming that this set is 
measurable, we can say that the amount the generalization error can change in the limit 
depends on P{D — E). 

Theorem 22 // E is measurable, the following are true: 

1. hmsupr^^ evToiHr) < 4,^)^^ (M:M) dP + P{D - E) 

2. liminfT-.oo eTHDiHr) > (kzMlfM^ dP - P{D - E) 
Additionally, if liniT^oo ^^^d{Ht) exists, then 

lim eiiDiHr) - ! ( '^-y^*i^) \ dp\ < p(D _ E). 
J{x,y)eE V 2 y I 

Proof 

We can bound eTTDiHr) as follows: 



1 - yHrix) 



err d{Ht) = ( o 

^ / 

J(x 



dP 



'(x,y) 

Symmetrically, we also have 



{x,y)eE V 2 / J(x,y)eD~E 

< [ ('^^'^]dP + P{D-E). (4) 
J(x,y)eE \ 2 y 
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evvniHr) > [ fLjl^lM] dP - P{D - E). (5) 

J(x,y)<^E V ^ / 

We will consider only Equation |4j and results for Equation [5] will follow symmetrically. 
By taking lim sup on both sides of Equation |4j we see that 

lim sup err 25 (/7t) < lim sup / / l-?/gr(x) \ ^p + P{D - E) 

T^oo T-5-00 J{x,y)£E V 2 / 

lim / C^^y^^dP + PiD-E) 



■^^^^ J(x,y)eE V 2 

} l-yH*{x) 

ix,y)eE \ 2 



dP + P{D - E). 



Symmetrically, we find 



\immfeTiD{HT)> [ (- — dP - P{D - E). 

J(x,y)eE V 2 y 



' ix,y)eE 

Finally, if limy^oo err£)(i7j') exists, we have 



liminf err£)(fZ^r) = lim err£)(i?j') = limsuperr£)(i^T). 

T-)-oo T^oo T-^oo 



To recap, we conjectured that the dynamics of AdaBoost drift away from the discontinu- 
ities in its state space. Assuming this conjecture is true, we outlined a series of propositions 
that establish that AdaBoost 's weight update A, along with the functions representing the 
various secondary parameters, all satisfy the conditions for Birkhoff's Ergodic Theorem. 
Using Theorem [9] as our main tool, we derived convergence results for various aspects of 
AdaBoost, and showed the margin of all examples in the training set converge. Given that 
the weak learner follows a certain framework, the convergence results on the training set 
can be extended to the whole instance space X. Finally, if the decision boundary of F* 
has probability 0, we conclude that the AdaBoost classifier Ht, along with the generaliza- 
tion error, converge. If the decision boundary has non-zero probability, we can say that 
the amount the generalization error can change in the limit depends on the probability of 
drawing an example on the decision boundary. 



6. Characterizing the Inverse 

When studying the dynamics of the AdaBoost update A, it is natural to ask when given 
w G Am, what is A~^{w)? Or similarly, when given E C Am what is A^^{E)7 To approach 
these questions, we decompose the inverse into a union of line segments. 

First, let us simplify notation by decomposing a weight w into the components that r/ 
gets wrong and right. 
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Definition 23 Let (1) and (2) be defined component-wise as 

1. w+{i) = w{i) {I - r]{i)) 

2. w~{i) = w{i) r}{i). 

The inverse set of 7^ at a point w \s a. line segment radiating outwards from w. 

Proposition 24 Suppose we have rj ^ M and ti; £ O. Then T^^{w) = {2pw^ + 2(1 
pG [0,1]} 

Proof Let L{r],w) = {2pw~ + 2(1 — p)w^\ p G [0, 1]}. Consider an element w' G L{r],w 
Clearly, w' = 2p'w~ + 2(1 — p')w^ for some p' G [0, 1]. Then, 

err(?7, tL)') = err(r7, 2/9'ri;~) + err(r/, 2(1 — /9')ti;^) 
= 2p'{r] ■ w) 

= /. 

Using this fact, we see for i such that ri{i) = 1 



1 



2{ri-w') 



2p'wi X ^ 
^ 2p' 



Wi. 



And similarly, for i such that r]{i) = 



2(1- {r]-w')) 
2(1 - p')wi X ^ 



2{l-p>) 



Wi- 



Pulling the cases together, we conclude that Trjiw') = w and w' ^(u)) 
Instead, suppose that w' G T^^{w). From Definition [Hj we see that 



/ 



2{i] ■ w')wi if r/(i) = 1 

2(1 - (?? • w'))wi if 7]{i) = 

Setting p' = rj ■ w' , we see that w' G L(ry, w). 
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So, Tri{w) has a very clean inverse, being simply a line through simplex space. But, it is 
important to note that %i{w) is hypothetical, asking "where would w go ii rj = rfT^ and is 
not the true AdaBoost weight update, A{w). Though, the inverse A~^{w) does decompose 
into a union of these line segments. 

Proposition 25 Let w G^. Then A'^^iw) = U,,eA/( V^^"') ^ ^*(^))- 

Proof Take w' € A~^{w). Then, w' G (7{rj'^) and w' € T~v}{w). Therefore, w' G 
UeM(V'HncT*(7?)). 

Instead, take w' G U»7eM('^~^(''^) '^*(^))- It must be the case that w' G Tjyu}{w) D 
f''* because w' can only be in cr*{rj') for one possible rj' £ M, namely r]' = h^. But by 
Definition 111 we see that implies A{w') = w. Therefore, w' G A~^{w). 



7. Satisfying Birkhoff 's Ergodic Theorem 



This section is devoted to proving Proposition 14 and 15 , and to do so we will use the funda- 



mental Krylov-Bogolyubov theorem, listed as 26 below. For a given dynamical system that 



meets certain conditions, Krylov-Bogolyubov tells us that the system is measure-preserving 
on some Borel probability measure. We will apply this theorem on our trapped attracting 
set Ooo to show that A admits an invariant measure. 



Recall that part (2) of Assumption 12 says that if cr(ri) D cr{r]') n G 7^ 0, then r] = rj' 
with respect to the weights in G. Let T-Li,7i2, ■ ■ ■ ,7ik be the equivalence classes over the 
rows of M where, for every pair r],r]' G Hi, we have Ylrri{i)=/=ri' (i) ^(^) ~ ^ for all w £ G. 
We construct a new matrix M' where row i is AdaSelect(^j). On G, it is easy to see that 
the dynamics of A with respect to M is the same with respect to M' . Meanwhile, A with 



respect to M' is bounded away from ties. Whenever part (2) of Assumption 12 holds, we 
will restrict our analysis to A with respect to M' , but for simplicity keep the notation M. 

A couple of concepts are essential in understanding this theorem. First, Krylov-Bogolyubov 
requires that we deal with a system of the form (X, T) , where X is the state space and T 
a topology on it. Furthermore, {X, T) needs to be metrizable, meaning the topology T can 
be induced by some metric. To simplify matters, we will treat as a metric space with 
the metric 



d{wi,W2) = ^ \wi{i) - W2{i)\. 



We will not use d directly in our proofs; but when we discuss convergence of sequences 
of weights in A^, we will implicitly use the metric. |^ 

4. Convergence is meaningless without such a metric. The definition of closed and open sets also implicitly 
use d: closed sets are the sets in Am that contain all of their limit points. That is, a set E is closed if, 
given any convergent sequence {wi} C E, we have XvcHi^aoWi £ E. Compact sets are closed sets that 
are bounded. We are only considering subsets of A^, so all such subsets are bounded and any closed 
subset will be compact. 
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Krylov-Bogolyubov requires that the state space is compact. We want to apply the 
theorem on Oqo, so this is the set we must scrutinize. As mentioned before, this set is 
contained within Am, so we know it is bounded. Hence, we only need to show that Qoo is 
closed. That is the motivation behind Theorem 1281 

Additionally, Krylov-Bogolyubov requires that A is continuous on iloo- We have stated 



in Assumption 12 that is bounded away from type 1 discontinuities, so what remains to 



show is that it does not contain type 2 discontinuities. This is accomplished in Theorem 30 



Theorem 26 (Krylov-Bogolyubov) Let (X,T) be a compact, metrizahle topological space 
and F : X ^ X a continuous map. Then F admits an invariant Borel probability measure. 



We will begin by showing the compactness of ^oo^, given Assumption 12 We first 
approach this by proving the following lemma, which states that any limit point vu G 
has a corresponding limit point w' € G such that A{w') = w.. 



Lemma 27 Suppose Assumption 12 holds. Let {wi\ be an arbitrary convergent sequence 
in Vtoa, and call its limit w. Then there exists a second convergent sequence {w'^} C such 
that A{limi^oow[) = w. 



Proof Let {wi\ C iloo be such a sequence as described in the hypothesis, and let w = 
limj^oo Wi. From the compactness of G, we have w £ G. Additionally, as Wi S r^oo, there 
must exist a w[ G fioo such that 

Wi = A{w'i). (6) 

Let {if-} C n oo be a sequence composed of such elements. We will now proceed to show 
that there exists a subsequence of {w'^} that has a limit w' E A^^{w). 
Consider subsets of G the form 



P*{v) = {9(^G\gea*{v)}. 
Note that G does not contain any elements that are tied, hence we have 

G=[j P*iv)- 

There exists an r] € M such that P*{rj) contains infinite elements from the sequence 
{w[}. Let {w[^ be the subsequence of {wi\ that is contained in P*{r]). Note that G is 
sequentially compact because G is compact subset of a metric space. Therefore, there exists 
a convergent subsequence {w'-^ }, and call its limit w' . We claim that w' G P*{ri). 

Let P{r]) = {g G G\g G a{r])}., which is simply the intersection G n and closure 
of P*{r]). It follows that P{r]) is closed, because both sets involved in the intersection are 
closed. Also note that P*{r]) C P{r]). The sequence {w[^ } is therefore contained in 
yielding w' G Piv)- Now, either w' £ P*iv) or uj' £ P{i]) — P*{rf), the later containing only 
weights in which r] is tied with another row in M. The second case is impossible because 
the hypothesis of the theorem does not allow ties, so we must conclude w' S P*{r]). 

Now we proceed to show that w' = A{w). From Equation [6| it is clear that there is a 
subsequence {wij^} of {wi} such that Wi^^ = A{w[. ). Whereby, 
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lim A{wi. ) = lim Wi. 

fc— >oo -'fc fc— s>oo * 



lim Wi 

i—^oo 

(7) 



By continuity of ^ on G (see Theorem 11), it is clear that 



lim A{w[ , ) = A [ lim w'^ . I 

= A{w'). (8) 



Then, combining Equation [T] and Equation [sj we conclude that A{w') 



w. 



Given any limit point w of Ooo , this lemma lets us construct an infinite orbit backwards 
from w contained entirely in G, whereby w E r^oo; giving us compactness. This is formalized 
in the next theorem. 

Theorem 28 Suppose Assumption^^ holds. Then, iloo is compact. 

Proof Let {wi} be an arbitrary convergent sequence contained in 0,^0, and let w = 

there exists a sequence {u^im} C Ooo converging to w'^^'^ G G such 
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limj_s.oo Wi . By Lemma 

that A{w^^'^) = w. However, notice that also satisfies the hypothesis of Lemma 

Applying the lemma to {tw^llj}, we get {wy^2)} ^oo converging to w^'^^ € G such that 
A{w^'^^) = w^^\ therefore A^'^\w^'^^) = w. We can continue in this way to generate w^"^ G G 
such that = w for any n. Therefore, w e A^'^\Am) for all n G N and we must 

conclude that w E i^oo- Because w was the limit of an arbitrary convergent sequence of 
Ooo, it must be the case that f^oo is compact. ■ 

Now we must turn to understanding the continuity properties of A. As previously 



mentioned, A is continuous on most points in its state space. Assumption 12 tells us that 
Ooo is bounded away from type 1 discontinuities. But, if A is continuous on our attracting 
set Qoo, we must show that Oqo contains no points that have a row with error zero. 

The following lemma takes a step towards this goal. It shows that, given a point w G 
0(ry), if the error of a hypothesis in M is low on w, then the error of that same hypothesis on 
the inverse of w is not too large. Not only that, but the error of rj on the inverse also is not 
too large. This lemma is used in proving the next theorem, which tells us that AdaBoost 
is bounded away from type 2 discontinuities. 

Lemma 29 Let ?], ry' G M and w G Q{r]). If rj' ■ w < eo, then for all w' G A~^{w) we have 
rj' ■ w' < 2eo and r] ■ w' < 2eo 
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Proof Let Mi{w) = {t] e M\r] ■ w = and let w{p,ri) = 2pw^ + 2(1 — p)w^ . Let 
L = {w{p,ri)\ 7] G Mi{w),p G [0, |)}, and note that A~^{w) Q L, so it suffices to show that 
the lemma holds for all elements in L. 

Take an arbitrary pair rj G Mi{w), rj' G M, and p G [0, ^). We can decompose rj' ■w{p,r]) 

as 

V' • wip,-!]) = 2p{7]' -w'-r]'- + 2{r]' ■ w^). (9) 

To upper bound rj' ■ w{p,ri), we consider three cases depending on the relationship 
between rj' ■ w~ and r/' • w^. 

1. If 7]' ■ w~ > 7]' ■ w^, then 

7]' • vij{p, 7]) < rj' ■ w~ + 7]' • 7v^ = r]' ■ w < eo. 

2. If T]' ■ w~ < 7]' ■ then 

7]' ■ 7v{p, 7]) < 27]' ■ W+ < 2{7]' ■ w) < 2eo. 

Taking the largest of the upper bounds, we conclude that rj' ■ 7v{p,7]) < 2eo. Now, 
if w{p,r]) G A~^{w), it follows that h = AdaSelect(ti;(/3, r/)). Therefore h ■ w{p,r]) < 
r]' ■ w{p,vi) < 2eo. ■ 

We can now apply the above lemma in a recursive manner to show that there exists 
an no such that for all n > 71q the error of any hypothesis on the points in A^'"'\/S.m) is 
bounded away from zero. 

Theorem 30 There exists an e^, > Q and uq such that, for all n > hq we have rj ■ w > e^ 
for all w G A'^"-\Am) and rj e M. 

Proof Set no = |M| + 1, and e* < 72^^. Take an arbitrary rj ^ M and w G such that 
rj ■ w < e*. We will show that such a. w is not contained within A^"^^ for n > no- 

Let w^^'> G A^'^{w). If no such w^^"^ exists, we have already demonstrated our goal. 



Then, let r/^^^ = rj'^^^\ By Lemma 29 we know that rj^^^ ■ w^^^ < 2e*. Continuing in this 
way, let w^'^^ G A~^{'w^^^), which we can assume exists by the same argument made for w^^\ 
Let 7?^^^ = Tj^ ^ , and note that r/(^) 7^ r/^*) for i < 2 because 



^(2).^(l) = 1 

' 2 

> 2e, 



We can continue this template out to no- Let G A~^{w^''^°~^^). We claim that such 

a it;("'') cannot exist. For sake of contradiction, suppose it did. Then, let r]^'^°^ = 7^^"'"°'. 
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From Lemma 29, we know that rj^"-o) ^ rj^^) for all i < uq because 



1 
2 

){'^o-l)j 



Furthermore, all ry^*-* are unique by the construction of this sequence. Because uq 



\M\ + 1, the sequence {rj^^^ . 



V 



(2) 



(m- 



= M. But because ^ for all i < no, 



not in M. As this is a contradiction, we must conclude that no such w^'^°'^ exists, 
and that A~^{w^'^°~^^) = 0. Our selection of w^'^'^ was arbitrary in each step, so we can also 
conclude that there does not exist any w' such that A'^''o\w') = w, OT else it would have been 
reached by the above procedure. Finally, this shows that w is not contained in A^'^''\Am)- I 



We now have the tools necessary to prove Proposition 14, originally stated in Section [Sj 
Proof (Proof of Proposition 14) If the condition holds, then $7oo is a compact and metriz- 
able topological space. Also, A is continuous on f^oo- It follows from Krylov-Bogolyubov 
that there exists a Borel probability measure /i that is invariant on A. We may extend this 
measure to all of Am- Let A C A^, then define ij-{A) = fj.{A n r^oo)- ■ 



Also, we now have the tools necessary to prove Proposition 15, stated in the same 
section. We prove a slightly modified one tailored to our new context. 



Proof (Proof of Proposition 15) 



1. Because e{w) is the minimum of a finite set of continuous functions, it follows that e(it;) 
is continuous as well. In the case of a Borel algebra, continuity implies measurability. 
It follows that 

/ \e{w)\dfj. < -d/i < -. 

Therefore, eiw) G L^ifJ')- 



2. Because e{w) is continuous and bounded away from on iloo, it follows that ai^w) is 
continuous as well. As above, this implies measurability. From e{w) > e* > 0, we 
have an upper bound on a{w) we will call a*. Therefore, we then have 

\a{w)\d^ < / a*dfi = a* fi{Qoo) = «*• 

Whereby a{w) G L^ifJ-)- 
3- X(T*(»7)(^) is measurable and bounded above by 1. Therefore, it is in 
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8. Preliminary Experimental Results Support the No-Ties Conjecture 

This section discusses preliminary experimental evidence that Assumption 1 1 2| holds in prac- 
tice. Recall that the assumption requires that, for any pair r],rj' G M, either (1) there are 
no ties between r] and ij' in the limit, or (2) if they are tied, they are effectively the same 
with respect to the weights in the limit. We provide empirical evidence on a few commonly 
used data sets in practice suggesting that these two conditions seem to be satisfied. 



cion 



In Figure [3| we AdaBoost decision stumps on the Heart-Disease (Frank and Asun- 



2010b), Sonar (Frank and Asuncion, 2010a), and Breast-Cancer (Frank and Asuncion 



2010a) datasets, while tracking the difference between the error of the best row and the 
second best row of M at each round t. Let be the optimal row in M at round t. When 
looking for the second best row at round t, we ignore rows r]' such that Ylr'q{i)=/='q' (i) "^tii) < e, 
where we set e = 10~^^ for Heart-Disease and Sonar, and e = 10~10 for Breast-Cancer. We 
start AdaBoost from an initial weight drawn uniformly at random from the m-simplex (as 
opposed to the traditionally used weight corresponding to a uniform distribution over the 
training examples). 

The difference between best and second best row tends to decrease to e early on. This 
happens because some weights go to zero for non- minimal- margin examples, which from 
now on we refer to as the "support vectors" because of their similar interpretation to those 
examples in SVMs. |^ Such zero- weight examples could cause certain rows to become essen- 
tially equal with respect to the weights. Once such weights go below e, a condition which 



we equate to essentially satisfying part (2) of Assumption 12 , we ignore these "equivalent" 
rows. In turn, this causes the trajectory of the differences between best and second best to 
jump upwards. After a sufficient number of rounds, the set of support vectors manifests, 
and this jumping behavior stops. At this point, it appears the distance from ties is bounded 



away from zero, suggesting Assumption 12 holds for AdaBoosting decision stumps on these 
datasets. 

Figure |4] provides reasonably clear evidence for the convergence of the Optimal AdaBoost 
classifier when boosting decision stumps on the Breast-Cancer dataset ??. In this figure, 
the margin for every example seems to be converging: from rounds 90k to 100k there is very 
little change, as seen most clearly in plot (b) of that figure. Figure [s] shows convergence 
of the minimum margin; this is essentially a more complete view of the convergence of the 
minimum margin clearly seen in the histograms in Figure |4](c). This converging behavior 
is as predicted by the theoretical work in Section [5] 



5. For all training examples i, denote by Prii) = margin^ (a;i) the "signed" margin of example i. From 
our convergence results we can show that = limy ->cx) mini /3T(i) exists. We can also show that 

pmin _ WT+i{i) Prii)- This implies that, for all training examples i, limT->oo Prii) > /3™'° 

implies limT->oo WT+i(i) ~ 0; and that limT-»oo u't+iC*) > implies limT->oo /3t(*) = Also, 
assuming training examples with different outputs, there always exists a pair of different-label examples 
{j,k), with 2/j = 1 (positive example) and ~ —1 (negative example), such that limT->oo wt+i (i) > 
and limT->oo WT+i{k) > (because the error j^t ■ wt+i ~ |, where rjr is the row of M corresponding 
to the dichotomy selected at round T). This in turn implies limT_>oo /3T(i) = ^i^T^cx Prik) — /3™'", 
leading to our interpretation of the set {i \ limT->oo Prii) ~ /3"""} as the set of "support vectors." 
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9. Closing Remarks 

In this paper we provide a non-constructive proof of existence of an invariant measure, 
provided the dynamics of AdaBoost satisfy certain conditions. An improvement of this 
result would be a direct proof for the existence of such a measure, perhaps by proving that 
AdaBoost always satisfies the conditions. An even stronger result would be a constructive 
proof of such a measure. There are some hints for such a proof lying in the simple nature 
of the inverse of the AdaBoost mapping A. 

While we provide convergence proofs, we do not provide convergence rates. We suspect 
that the rate varies significantly between datasets and choice of weak learner. For example, 
on datasets where AdaBoost tends to overfit, we suspect the rate of convergence is slower. 
On the other hand, the stronger the weak learner, the quicker the rate of convergence seems 
to be. Hence why the generalization error of AdaBoost seems to quickly converge when 
using decision trees. 

Finally, we would like to say something about the quality of the generalization error, 
beyond just that it converges. In all of our experiments involving decision stumps, we 
have observed a logarithmic growth of the number of unique hypothesis contained in the 
combined AdaBoost classifier as a function of time. Such a logarithmic growth yields a 
tighter data-dependent bound on the generalization of the AdaBoost classifier. We believe 
that the distribution of the invariant measure over the regions cr{h) is an important factor for 
this behavior: the relative frequency of selecting each hypothesis seems Gamma distributed. 
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Rounds of Boosting 




Rounds of Boosting 



Figure 3: Evidence for No Ties. These plots depict the difference between the errors 
for the best row and the second best row (log scale) as a function of the number 
of rounds of boosting decision stumps on the Heart-Disease ( [Frank and Asui> 



cion 



2010b) (top), Sonar (Frank and Asuncion 2010a) (center), and Breast 



Cancer Frank and Asuncion (2010a) datasets (bottom). The behavior depicted 
in these plots suggests that Assumption |12| holds. Recall that, as described in 
the body of the text, when looking for the second best row at time t, we ignore 
rows rj' such that J2rr]{i)jtr)' (i) "^ti^) < where we set e = 10"-*^^ for Heart-Disease 
and Sonar, and e = 10^^*^ for Breast-Cancer, and that we start AdaBoost from 
a weight over the training examples drawn uniformly at random from the m- 
simplex. 
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(c) 

Figure 4: Evidence for the Convergence of Optimal AdaBoost Classifier when Boosting 
Decision Stumps on the Breast-Cancer Dataset. Plot (a) shows the behavior of 
the "signed" margin margin^ (a;^) of every example i as a function of the number of 
rounds T of boosting (log scale). Plot (b) is a closer look at the asymptotic behavior 
of these margins from rounds T — 9QK to IQQK . Evidence for the convergence of the 
signed margins is more evident at this resolution. Plot (c) shows the histogram of signed 
margins at rounds T = IK, IQK, 20K, AOK, 90K, lOOA'. The histograms contain 200 bins. 
Note that they are all positive because from the theory of AdaBoost, assuming the weak- 
learning hypothesis holds, all the training examples are correctly classified eventually 
after some finite number of rounds (logarithmic in m), so that the signed margin will 
always be positive. Note also that the examples in the histogram whose signed margin is 
closest to zero correspond to the "support vectors" (see main text for further discussion) . 
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Min. Margin Converging for Optimal AdaBoost with Decision Stumps on Canoe Dataset 
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Figure 5: Evidence for the Convergence of the Minimum Margin. This plot depicts 
the minimum margin as a function of the number of rounds of boosting (log 
scale) on the Breast-Cancer dataset (Prank and Asuncion, 2010a), using decision 
stumps. This is an isolation of the minimum margin from Figure [4](c). 
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